Generating long pieces of music is a challenging problem, as music contains structure at multiple timescales, from milisecond timings to motifs to phrases to repetition of entire sections. In this project we trained 2 models on the Bach chorales dataset to generate Bach-like music. This is an excercise problem from chapter 15 of the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Géron. The exercise is as follow:
Download the Bach chorales dataset and unzip it. It is composed of 382 chorales composed by Johann Sebastian Bach. Each chorale is 100 to 640 time steps long, and each time step contains 4 integers, where each integer corresponds to a note's index on a piano (except for the value 0, which means that no note is played). Train a model—recurrent, convolutional, or both—that can predict the next time step (four notes), given a sequence of time steps from a chorale. Then use this model to generate Bach-like music, one note at a time: you can do this by giving the model the start of a chorale and asking it to predict the next time step, then appending these time steps to the input sequence and asking the model for the next note, and so on.
But along with the CNN model as suggested, we implemented two models for generating music:
A musical piece often consists of recurring elements at various levels, from motifs to phrases to sections such as verse-chorus. To generate a coherent piece, a model needs to reference elements that came before, sometimes in the distant past, repeating, varying, and further developing them to create contrast and surprise. But before we delve into the technical implementation let us understand the building blocks of music:
Labels of the notes are (in sharp, #, notation):
C# D# F# G# A#
C D E F G A B ...
Labels of the notes are (in flat, $\flat$, notation):
Db Eb Gb Ab Bb
C D E F G A B ...
The A in the 4th octave is typically tuned at 440 Hz
Frequency of note is implemented as:
$$f = f_{\mathrm{A4}}\bigg( {}^{12}\sqrt{2} \bigg)^ N $$where $N$ is the number of steps needed (can be negative) to move from A4 to the desired note.
Scale: A scale is a selection of notes that fit well together.
Chords: A chord is any harmonic set of pitches/frequencies consisting of multiple notes that are heard as if sounding simultaneously.
Arpeggio: An arpeggio is a type of broken chord in which the notes that compose a chord are individually sounded in a progressive rising or descending order.
Now that we know the building blocks of music, let us understand how we can generate music with deep learning. We take a language-modeling approach to training generative models for symbolic music. Hence we represent music as a sequence of discrete tokens, with the vocabulary determined by the dataset.
The JSB Chorale dataset consists of four-part scored choral music, which can be represented as a matrix where rows correspond to chords and columns to time discretized notes. The matrix’s entries are integers that denote which pitch is being played. Notes range from 36 (C1 = C on octave 1) to 81 (A5 = A on octave 5), plus 0 for silence:
This is very similar to time-series data or word sequence data in NLP. So we took a sequence to sequence modeling approach for generating the output note sequences. Each chorale will be a long sequence of notes (rather than chords), and we can just train a model that can predict the next note given all the previous notes. We will feed a window to the neural net, and it tries to predict that same window shifted one time step into the future.
The dataset is composed of 382 chorales composed by Johann Sebastian Bach. Each chorale is 100 to 640 time steps long, and each time step contains 4 integers, where each integer corresponds to a note's index on a piano (except for the value 0, which means that no note is played).
The dataset is available here: https://github.com/ageron/data/tree/main/jsb_chorales
In our experiment CNN+LSTM model and Transformer Model achieved accuracy score 0.815 and 0.812 respectively. Though the accuracy score of both is in the same range yet from the graphs we found that if the sequence is long then the Transformer model might loose long-term coherence, as shown in the following graphs:
Fig: Generated chorale by CNN+LSTM model
Fig: Generated chorale by Transformer model
While the Transformer allows us to capture self-reference through attention, it relies on absolute timing signals and thus has a hard time keeping track of regularity that is based on relative distances, event orderings, and periodicity.
The chapter 15 notebook from the book is used as a reference to implement the cnn model. Then we developed a transformer model for the same task.
The notebook was implemented on kaggle with P100 GPU accelerator (30GB RAM).
import sys
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
tf.keras.utils.get_file(
"jsb_chorales.tgz",
"https://github.com/ageron/data/raw/main/jsb_chorales.tgz",
cache_dir=".",
extract=True)
Downloading data from https://github.com/ageron/data/raw/main/jsb_chorales.tgz 117793/117793 [==============================] - 0s 0us/step
'./datasets/jsb_chorales.tgz'
train_files = os.listdir('datasets/jsb_chorales/train')
print("This training folder contains {len_folder} file(s).".format(len_folder=len(train_files)))
valid_files = os.listdir('datasets/jsb_chorales/valid')
print("This validation folder contains {len_folder} file(s).".format(len_folder=len(valid_files)))
test_files = os.listdir('datasets/jsb_chorales/test')
print("This test folder contains {len_folder} file(s).".format(len_folder=len(test_files)))
This training folder contains 229 file(s). This validation folder contains 76 file(s). This test folder contains 77 file(s).
sample_df = pd.read_csv("datasets/jsb_chorales/train/chorale_201.csv")
sample_df
| note0 | note1 | note2 | note3 | |
|---|---|---|---|---|
| 0 | 71 | 67 | 64 | 52 |
| 1 | 71 | 67 | 64 | 52 |
| 2 | 71 | 67 | 64 | 52 |
| 3 | 71 | 67 | 64 | 52 |
| 4 | 71 | 66 | 59 | 50 |
| ... | ... | ... | ... | ... |
| 299 | 64 | 59 | 56 | 52 |
| 300 | 64 | 59 | 56 | 52 |
| 301 | 64 | 59 | 56 | 52 |
| 302 | 64 | 59 | 56 | 52 |
| 303 | 64 | 59 | 56 | 52 |
304 rows × 4 columns
jsb_chorales_dir = Path("datasets/jsb_chorales")
train_files = sorted(jsb_chorales_dir.glob("train/chorale_*.csv"))
valid_files = sorted(jsb_chorales_dir.glob("valid/chorale_*.csv"))
test_files = sorted(jsb_chorales_dir.glob("test/chorale_*.csv"))
def load_chorales(filepaths):
return [pd.read_csv(filepath).values.tolist() for filepath in filepaths]
train_chorales = load_chorales(train_files)
valid_chorales = load_chorales(valid_files)
test_chorales = load_chorales(test_files)
print ("Length of train chorales: ", len(train_chorales))
train_chorales[0]
Length of train chorales: 229
[[74, 70, 65, 58], [74, 70, 65, 58], [74, 70, 65, 58], [74, 70, 65, 58], [75, 70, 58, 55], [75, 70, 58, 55], [75, 70, 60, 55], [75, 70, 60, 55], [77, 69, 62, 50], [77, 69, 62, 50], [77, 69, 62, 50], [77, 69, 62, 50], [77, 70, 62, 55], [77, 70, 62, 55], [77, 69, 62, 55], [77, 69, 62, 55], [75, 67, 63, 48], [75, 67, 63, 48], [75, 69, 63, 48], [75, 69, 63, 48], [74, 70, 65, 46], [74, 70, 65, 46], [74, 70, 65, 46], [74, 70, 65, 46], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [74, 70, 65, 46], [74, 70, 65, 46], [74, 70, 65, 46], [74, 70, 65, 46], [75, 69, 63, 48], [75, 69, 63, 48], [75, 67, 63, 48], [75, 67, 63, 48], [77, 65, 62, 50], [77, 65, 62, 50], [77, 65, 60, 50], [77, 65, 60, 50], [74, 67, 58, 55], [74, 67, 58, 55], [74, 67, 58, 53], [74, 67, 58, 53], [72, 67, 58, 51], [72, 67, 58, 51], [72, 67, 58, 51], [72, 67, 58, 51], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [72, 69, 65, 53], [74, 71, 53, 50], [74, 71, 53, 50], [74, 71, 53, 50], [74, 71, 53, 50], [75, 72, 55, 48], [75, 72, 55, 48], [75, 72, 55, 50], [75, 72, 55, 50], [75, 67, 60, 51], [75, 67, 60, 51], [75, 67, 60, 53], [75, 67, 60, 53], [74, 67, 60, 55], [74, 67, 60, 55], [74, 67, 57, 55], [74, 67, 57, 55], [74, 65, 59, 43], [74, 65, 59, 43], [72, 63, 59, 43], [72, 63, 59, 43], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [72, 63, 55, 48], [75, 67, 60, 60], [75, 67, 60, 60], [75, 67, 60, 60], [75, 67, 60, 60], [77, 70, 62, 58], [77, 70, 62, 58], [77, 70, 62, 56], [77, 70, 62, 56], [79, 70, 62, 55], [79, 70, 62, 55], [79, 70, 62, 53], [79, 70, 62, 53], [79, 70, 63, 51], [79, 70, 63, 51], [79, 70, 63, 51], [79, 70, 63, 51], [77, 70, 63, 58], [77, 70, 63, 58], [77, 70, 60, 58], [77, 70, 60, 58], [77, 70, 62, 46], [77, 70, 62, 46], [77, 68, 62, 46], [75, 68, 62, 46], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [75, 67, 58, 51], [74, 67, 58, 55], [74, 67, 58, 55], [74, 67, 58, 55], [74, 67, 58, 55], [75, 67, 58, 53], [75, 67, 58, 53], [75, 67, 58, 51], [75, 67, 58, 51], [77, 65, 58, 50], [77, 65, 58, 50], [77, 65, 56, 50], [77, 65, 56, 50], [70, 63, 55, 51], [70, 63, 55, 51], [70, 63, 55, 51], [70, 63, 55, 51], [75, 65, 60, 45], [75, 65, 60, 45], [75, 65, 60, 45], [75, 65, 60, 45], [74, 65, 58, 46], [74, 65, 58, 46], [74, 65, 58, 46], [74, 65, 58, 46], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [74, 65, 58, 58], [74, 65, 58, 58], [74, 65, 58, 58], [74, 65, 58, 58], [75, 67, 58, 57], [75, 67, 58, 57], [75, 67, 58, 55], [75, 67, 58, 55], [77, 65, 60, 57], [77, 65, 60, 57], [77, 65, 60, 53], [77, 65, 60, 53], [74, 65, 58, 58], [74, 65, 58, 58], [74, 65, 58, 58], [74, 65, 58, 58], [72, 67, 58, 51], [72, 67, 58, 51], [72, 67, 58, 51], [72, 67, 58, 51], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [72, 65, 57, 53], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46], [70, 65, 62, 46]]
Notes range from 36 (C1 = C on octave 1) to 81 (A5 = A on octave 5), plus 0 for silence
notes = set()
for chorales in (train_chorales, valid_chorales, test_chorales):
for chorale in chorales:
for chord in chorale:
notes |= set(chord)
n_notes = len(notes)
min_note = min(notes - {0})
max_note = max(notes)
assert min_note == 36
assert max_note == 81
Let's write a few functions to listen to these chorales:
from IPython.display import Audio
def notes_to_frequencies(notes):
# Frequency doubles when you go up one octave; there are 12 semi-tones
# per octave; Note A on octave 4 is 440 Hz, and it is note number 69.
return 2 ** ((np.array(notes) - 69) / 12) * 440
def frequencies_to_samples(frequencies, tempo, sample_rate):
note_duration = 60 / tempo # the tempo is measured in beats per minutes
# To reduce click sound at every beat, we round the frequencies to try to
# get the samples close to zero at the end of each note.
frequencies = (note_duration * frequencies).round() / note_duration
n_samples = int(note_duration * sample_rate)
time = np.linspace(0, note_duration, n_samples)
sine_waves = np.sin(2 * np.pi * frequencies.reshape(-1, 1) * time)
# Removing all notes with frequencies ≤ 9 Hz (includes note 0 = silence)
sine_waves *= (frequencies > 9.).reshape(-1, 1)
return sine_waves.reshape(-1)
def chords_to_samples(chords, tempo, sample_rate):
freqs = notes_to_frequencies(chords)
freqs = np.r_[freqs, freqs[-1:]] # make last note a bit longer
merged = np.mean([frequencies_to_samples(melody, tempo, sample_rate)
for melody in freqs.T], axis=0)
n_fade_out_samples = sample_rate * 60 // tempo # fade out last note
fade_out = np.linspace(1., 0., n_fade_out_samples)**2
merged[-n_fade_out_samples:] *= fade_out
return merged
def play_chords(chords, tempo=160, amplitude=0.1, sample_rate=44100, filepath=None):
samples = amplitude * chords_to_samples(chords, tempo, sample_rate)
if filepath:
from scipy.io import wavfile
samples = (2**15 * samples).astype(np.int16)
wavfile.write(filepath, sample_rate, samples)
return display(Audio(filepath))
else:
return display(Audio(samples, rate=sample_rate))
Now let's listen to a few chorales:
for index in range(3):
play_chords(train_chorales[index], tempo=200, amplitude=0.5, sample_rate=44100)